Genetic Epidemiology — Latest Matching Preprints

1

Conditional and marginal SNP-heritability to leverage ancestral and environmental diversity

Singh Sachan, A. N.; Schwartzman, A.; Azriel, D.

2026-05-29 genetics 10.64898/2026.05.28.728536 medRxiv

Top 0.1%

21.9%

Show abstract

SNP-heritability is defined as the fraction of variance of a trait that is explained by the SNPs in a genome-wide association study. Several methodologies have been proposed to estimate this quantity. More recent methods aim to do so with ancestrally diverse datasets and yet obtain a single heritability for an entire dataset, which we refer to as marginal heritability. However, the different underlying subpopulations that compose a genetically diverse dataset might have different environmental and genetic exposures, and thus may have different heritabilities. In order to address this, we propose a conditional SNP-heritability approach that allows to estimate multiple SNP-heritabilities on a dataset corresponding to different ancestral compositions and environmental exposures. We take a careful statistical approach, including estimation of conditional genetic and environmental variances, and calculation of standard errors via a combination of the delta method with bootstrapping. We validate our method via extensive simulations. We then apply it to an ancestrally and socio-economically diverse dataset of 6603 subjects aged around 9 to 11 from the Adolescent Brain Cognitive Development study, and illustrate how the SNP-heritability of intelligence scores can change due to differing extrinsic variances in different socio-economic groups, which coincides with previous work in the literature. This conditional estimation approach can be a valuable tool for understanding differences in risks across subpopulations. Our work here improves on existing methodology and allows us to leverage the heterogeneity of the data to obtain new insights.

2

Estimating uncertainty in family-based GWAS

Miao, X.; Edge, M. D.; Harpak, A.

2026-05-14 genetics 10.64898/2026.05.11.724392 medRxiv

Top 0.1%

14.2%

Show abstract

Standard genome-wide association studies (GWASs) are vulnerable to confounding factors, including stratification, assortative mating, and dynastic effects. Family studies such as sibling-based GWAS (sib-GWAS) mitigate such confounding and are becoming the tool of choice for teasing apart direct genetic effects--causal effects of ones genotype on ones own phenotype-- from other factors. However, due in part to their smaller sample sizes, sib-GWAS allelic effect estimates are substantially more variable than standard (i.e., population-based) GWAS estimates. The quantification of this uncertainty is essential for many uses of sib-GWAS, including polygenic scoring, causal inference (e.g., Mendelian randomization), disentangling direct from indirect familial effects, and measuring assortative mating. Here, we investigate sources of uncertainty in sib-GWAS allelic effect estimators. We study their impacts on the biases of three uncertainty measurement methods, including two that are commonly used and a new resampling-based approach we propose. We find that heterogeneity in allelic effects or heteroskedasticity across families (e.g., due to variation in genetic backgrounds or environments) can bias existing methods, and that this bias is more severe for small samples and rare variants. In contrast, the resampling-based approach we propose is approximately unbiased under all scenarios we considered. We validate our theoretical predictions, as well as the importance of effect heterogeneity and heteroskedasticity, using simulations and empirical analysis in the UK Biobank. In sum, this study helps understand the sources of uncertainty in family-based genotype-phenotype association studies and provides a robust method to estimate uncertainty.

3

When can whole-genome SNP heritability be reliably estimated from summary statistics?

Pham, B. K.; Davenport, S.; Azriel, D.; Schwartzman, A.

2026-05-16 genetics 10.64898/2026.05.13.724972 medRxiv

Top 0.1%

10.5%

Show abstract

LD Score Regression (LDSC) is a prominent method, which estimates whole-genome SNP heritability from summary statistics via the slope of a linear regression of GWAS test statistics corresponding to a trait of interest against LD scores. It was claimed by the LDSC authors that the free intercept in the regression accounts for confounding bias such as population stratification. In this study, we argue that the intercept in LDSC must be fixed to 1 for accurate SNP heritability estimation. We show both theoretically and with simulations that the estimated intercept does not accurately capture population stratification effects, and that it adversely affects the accuracy of the heritability estimate introducing bias and increasing variance. Fixing the intercept to 1 eliminates bias and reduces variance when no population stratification is present. On the other hand, under population stratification, LDSC is biased with both the free and the fixed intercept. Additionally, we show that estimated standard errors in LDSC are underestimated, potentially leading to false-positives in downstream GWAS analyses.

4

Functionally informed annotation influences pathway-specific polygenic risk and disease inference in Alzheimer's disease

Bazemore, K.; Iqbal, T.; Kuzma, A. B.; Grant, S. F. A.; Schellenberg, G. D.; Wang, L.-S.; Chesi, A.; Jin, J.; Naj, A. C.

2026-05-26 epidemiology 10.64898/2026.05.25.26353905 medRxiv

Top 0.1%

9.0%

Show abstract

Pathway-specific polygenic risk scores (pathway-PRS) measure aggregate genetic risk across single nucleotide variants (SNVs) annotated to genes in a pathway of interest. In most applications, SNV-to-gene annotation is based on SNV position with respect to gene boundaries. This approach is ill-suited for incorporating non-coding SNVs, which can regulate gene expression over long distances and represent a large proportion of risk variants for Alzheimer's disease (AD). Here, we compare the performance of AD pathway-PRS across SNV-to-gene annotation strategies that integrate varying levels of functional genomic data, including adult brain chromatin interaction and expression quantitative trait loci (eQTL) data. In the UK Biobank (n=328,526), including AD cases defined by ICD-9/10 codes (n=3,043) and by family history of AD/dementia (n=38,589), we show that the annotation strategy integrating chromatin interaction and eQTL data consistently improves pathway-PRS performance. We replicate this finding in independent data from the Alzheimer's Disease Genetics Consortium (n=3,370). We further find that pathway-PRS associations with AD vary by annotation strategy and that power to detect sex-dependent and age-at-onset associations is increased with integrative annotation. Together, these findings support the use of functionally informed SNV-to-gene annotation for pathway-PRS construction and highlight the importance of applying multiple annotation strategies for robust inference.

5

A General Statistical Framework for Hardy-Weinberg Equilibrium Inference on the X Chromosome

Zhang, L.; Paterson, A. D.; Sun, L.

2026-05-20 genetics 10.64898/2026.05.17.725730 medRxiv

Top 0.1%

8.2%

Show abstract

Testing for Hardy-Weinberg equilibrium (HWE) is a fundamental component of genetic data analysis, widely used for quality control and model validation. Although HWE testing is well established for autosomal loci, inference on the X chromosome is more complex due to sex-specific genotype structures and potential sex differences in minor allele frequency (sdMAF). Existing tests differ in their assumptions about sdMAF and male sample inclusion, often leading to distinct but poorly characterized null hypotheses. We develop a general statistical framework for HWE inference using the robust allele-based regression model. By formulating HWE testing as an assessment of allele-level dependence, the framework directly parameterizes Hardy-Weinberg disequilibrium, unifies existing Pearson{chi} 2-based tests under explicit modeling assumptions, and clarifies their null hypotheses, degrees of freedom, and sensitivity to sdMAF. The framework also accommodates covariate and population-structure adjustment within a unified regression-based formulation. The proposed framework provides robust, interpretable, and flexible inference, establishing a unified statistical foundation for HWE testing across autosomal and X-chromosomal regions. Simulation studies and analysis of high-coverage 1000 Genomes Project data demonstrate that commonly used X-chromosome tests can exhibit inflated type I error or misleading inference when sdMAF is present.

6

Direct and mediated effects (DME) SLCMA: a novel method for life course modelling with time-varying covariates

Beer, S.; Simpkin, A. J.; Eldeeb, S. Y.; Zar, H. J.; Stein, D. J.; Dunn, E. C.; Smith, A. D. A. C.

2026-06-06 epidemiology 10.64898/2026.05.29.26354427 medRxiv

Top 0.2%

4.0%

Show abstract

Background: In prospective cohort studies, where an exposure is collected repeatedly, interest often lies in determining whether the timing of that exposure has a differential effect on a later outcome. The Structured Life Course Modeling Approach (SLCMA), where users select between temporal hypotheses of exposure specified a priori, provides one way to analyse such longitudinal data. However, few studies using SLCMA consider the effect of time-varying covariates (TVC) which may impact associations. Methods: We present a modified version of the SLCMA - called direct and mediated effects (DME)-SLCMA - which corrects for TVC. We first develop the DME-SLCMA method, test it through simulation, and apply it to psychosocial data from the Drakenstein Child Health Study (DCHS, n=336) to investigate relationships between maternal psychopathology, TVC of socioeconomic status, and offspring depressive symptoms. Results: We found that, on average, offspring depressive symptoms score increased by 3.9% (95% CI: 1.0%-6.9%, p = 0.039) for each unit of maternal psychopathology (SRQ) at 48 months whilst adjusting for time-varying socioeconomic status (at 18, 30, 42 and 54 months). Our simulations identified several realistic scenarios where selections ignoring TVC - with TVC mediated exposure effects present - were prone to be incorrect, including our DCHS example. Conclusion: DME-SLCMA is a robust new approach for life course modelling in the presence of time-varying covariates. We recommend adjusting for TVC whenever possible, and, when not possible, our simulation study identified that scenarios where mediated effects are comparable, or greater, in magnitude to direct effects are most prone to confounding.

7

Genomic-Relatedness Matching Expands Population Coverage, Improves Power, and Reduces Bias in Genetic Association Analyses

Jaishankar, D.; Gjorgjieva, T.; Jala, J.; Swigert, J.; Young, A. S.; Benjamin, D. J.; Cesarini, D. A.; Turley, P.

2026-05-18 genetic and genomic medicine 10.64898/2026.05.14.26353140 medRxiv

Top 0.2%

3.7%

Show abstract

We introduce a novel approach, Genomic-Relatedness-Matched Association (GRMA) studies, as an alternative to genome-wide association studies (GWAS). GWAS are typically restricted to samples of mostly unrelated individuals with a single, shared continental ancestry and nevertheless can still be biased by gene-environment correlation and assortative mating. In contrast, GRMA can be implemented in ancestrally diverse samples--retaining individuals of mixed or underrepresented ancestries and eliminating the need to assign labels to ancestry groups--and can reduce bias relative to standard GWAS. GRMA matches each individual to a group of controls whose pairwise relatedness with the individual exceeds a user-specified threshold. It generates SNP-level summary statistics based on within-group associations. In applications using the UK Biobank and All of Us data, we find that GRMA compares favorably to GWAS methods in terms of bias, precision, and population coverage. GRMA enables several novel findings; for example, we find that "genetic nurture" is unlikely to be an important source of genome-wide bias in population GWAS of body mass index, height, and educational attainment. The method is computationally efficient and supported by open-source software, facilitating its application in large-scale scientific and health-related studies.

8

Detecting genomic regions enriched for reciprocal recombination in autism spectrum disorder

Mahoney, C. F.; Salter-Townshend, M.; Fitzpatrick, D. J.; Shields, D. C.

2026-05-27 genetics 10.64898/2026.05.26.727863 medRxiv

Top 0.2%

3.5%

Show abstract

Meiotic recombination is an important means of increasing genetic diversity by generating novel haplotypes in a population. Recombination separates linked loci extremely slowly in some regions, therefore genetic variants in high linkage disequilibrium may become co-adapted. Reciprocal recombination that separates co-adapted variants may generate a deleterious de novo haplotype that contributes to disease. We developed statistical methods to detect genomic regions of recombination excess in two different family-based study designs. We identified recombination in the Simons Simplex Collection in 273 simplex families with one child with autism spectrum disorder (ASD) and at least two unaffected children, in which recombinations can be mapped to the proband and contrasted with the recombination counts in unaffected siblings; and in 1,802 families with two children, where the number of recombinations identified can be contrasted with the expectation from a reference recombination map. Both strategies revealed a tail of low p-values for loci of interest that contrasted with the rest of the distribution. Permutation and bootstrap tests did not identify genome-wide primary findings in either cohort, but the most significant three-child cohort locus of recombination excess (between cadherin genes CDH4 and CDH26) replicated in the two-child cohort (p=0.01). While this replication strategy was not defined a priori, five of the most recombination enriched bins identified candidate ASD genes (p=0.02; WWOX, ADAMTS16, INSR, ADARB2, and HS6ST1). Since the six identified loci were not identified as regions of high de novo copy number variation in the study cohort and no CNVs were detected in any of the recombinant probands in the identified regions, they represent candidates for reciprocal recombinations generating unfavourable haplotypes for these genes. This study highlights a previously unidentified source of clinical genetic variability contributing to the molecular aetiology of ASD. AUTHOR SUMMARYAutism spectrum disorder (ASD) is a constellation of neurodevelopmental disabilities characterised by deficits in social communication and repetitive patterns of behaviour. While ASD is highly heritable, its genetic basis is complex and poorly understood. While some highly penetrant types of genetic variation have been identified, most people with ASD carry a large number of variants that each contribute a small amount to their overall phenotype. In addition to mutations in individual genes, changes in the configuration of genes along a chromosome may contribute to ASD. Here, we describe a method for identifying regions where such new configurations have occurred through recombination and attempt to find regions where such changes are more common in autistic children than in their non-autistic siblings. We explore recombination as a source of genetic variation contributing to autism, which has potential to inform clinicians in providing services to autistic people and their families.

9

CN-RNN: a Deep Learning Framework for Copy Number Variation Detection with Exome Sequencing Data

Wang, D.; Qin, F.; Bao, W.; Bacher, R.; Chung, D.; Lu, Q.; Efron, P. A.; Cai, G.; Xiao, F.

2026-05-15 genetics 10.64898/2026.05.13.724920 medRxiv

Top 0.3%

2.5%

Show abstract

Copy number variations (CNVs) are major structural genomic variants that contribute to a wide range of human diseases. Accurate detection of CNVs from whole-exome sequencing (WES) data has been a long-sought goal for clinical and population genetic studies. Despite recent progress, existing WES-based CNV callers still suffer from high false-positive rates and reduced recall for short-length variants, and current deep learning methods have not fully used complementary information in region-level genomic features. Here we present CN-RNN, a deep learning-based CNV caller for WES data. The model combines a bidirectional long short-term memory (BiLSTM) branch that captures local depth changes and contextual dependencies across neighboring exons with a parallel multi-layer perceptron (MLP) branch that encodes region-level metadata such as GC content, mappability, and exon length. CN-RNN was trained on the Autism Sequencing Consortium (ASC) parent-child trio cohort using the Mendelian rule of inheritance to ensure high-quality training sets. It was evaluated across three independent datasets, in which we showed that CN-RNN outperformed existing WES-based CNV callers and deep learning methods. CN-RNN offers a scalable, accurate tool for CNV profiling in WES-based studies and supports broader application of CNV analysis in population and clinical research. CN-RNN is available at https://github.com/FeifeiXiao-lab/CN-RNN.

10

Leveraging cis- and trans-variants to improve protein expression level prediction for proteome-wide association studies

Dong, R.; Lamb, D.; Wang, G.; DeWan, A.; Leal, S. M.

2026-05-28 genetics 10.64898/2026.05.28.728201 medRxiv

Top 0.3%

2.5%

Show abstract

Since genetic effects are often mediated through proteins, the analysis of proteomic data can provide insights into disease etiology. However, most studies lack proteomic data. To address this problem, we developed TransCisPredict to perform proteome-wide association studies (PWAS) at a biobank scale. TransCisPredict reduces computational burden through linkage-disequilibrium block selection which facilitates incorporating cis- and trans-variants to predict protein expression and performs protein-phenotype association analyses. To account for differences in protein regulatory architecture, four prediction methods are used for weight estimation, i.e., BayesR, Elastic Net, LASSO, and SuSiE. Five-fold cross-validation (CV) is used to select the optimal method for each protein. Weight estimation was performed using White British UK Biobank study subjects (N=42,644) with proteomic and genotype array data. Of the 2,920 available protein expression levels, 2,339 could be predicted with a CV-R2>0.05 when cis- and trans-variants were used. Since most methods are limited to cis-variation, for comparison only cis-variants were used for prediction yielding 466 proteins with a CV-R2>0.05. A PWAS was performed for 2,339 predicted protein expression levels and type 2 diabetes (T2D) using White British UK Biobank study subjects without proteomic data (N=364,132) followed by two-sample Mendelian randomization using a method that controls for horizontal pleiotropy for validation. Forty proteins were associated with T2D and validated. For the 466 cis-only predicted protein expression levels, three proteins were associated with T2D and validated. Incorporating both cis- and trans-variation using TransCisPredict facilitates the prediction of many more proteins compared to using cis-only variants thereby increasing the power of PWAS.

11

Prioritizing embryos with lower homozygosity may reduce disease risk in children of related individuals undergoing preimplantation genetic testing

Wolfram, T.; Ahangari, M.; Davidson, I.; Wartschinski, L.; Li, J. H.; Eyre, M.; Stern, D.; Schleede, J.; Haghighi, A.; Carmi, S.; Christensen, M.

2026-06-04 genetic and genomic medicine 10.64898/2026.05.30.26354526 medRxiv

Top 0.3%

1.9%

Show abstract

Consanguinity is a reproductive union between individuals who share a recent common ancestor. These unions are common in many regions of the world and increase the burden of rare recessive disorders by elevating autozygosity in offspring. Current reproductive genetic screening focuses on a limited set of known pathogenic variants, leaving most recessive risk unaddressed. Here we argue that embryo-level autozygosity, quantified as the fraction of the genome in long runs of homozygosity (FROH), is a potentially actionable genomic biomarker that can be integrated into routine preimplantation genetic testing as a homozygosity-informed embryo-prioritization framework (PGT-H) that can be layered onto existing embryo biopsy workflows when couples are already undergoing IVF with PGT-A or PGT-M. Using forward simulations of first-cousin and double-first-cousin couples, we show that siblings conceived by the same couple span a wide range of FROH; selecting the lowest-FROH candidate from a cohort of five embryos reduces FROH by approximately 40% on average. Combining these reductions with empirical effect-size estimates, we estimate that for first-cousin couples this strategy could reduce risk of intellectual disability by roughly 35-45% (corresponding to an absolute risk reduction of about 1.8-2.2%) and potentially reduce excess recessive disease burden, while also modestly reducing risk of common diseases such as type 2 diabetes. We outline how existing PGT-A and PGT-M workflows could potentially be extended to report embryo-level FROH and discuss ethical and counseling considerations. Autozygosity-based embryo prioritization offers a principled way to address a component of recessive risk that current variant-centric approaches miss.

12

Large-scale association study identifies lung cancer susceptibility copy number variants and their potential functional role in genetic instability

Xiao, F.; Qin, F.; Luo, X.; Slewitzke, S. E.; Fernandes, G. F.; Johansson, M.; Xiao, X.; Zaridze, D.; Bojesen, S. E.; Shete, S.; Albanes, D.; Aldrich, M. C.; Tardon, A.; Fernandez-Tardon, G.; Le Marchand, L.; Rennert, G.; Bickeböeller, H.; Wichmann, H.-E.; Risch, A.; Muley, T.; Rosenberger, A.; Field, J. K.; Davies, M.; Woll, P.; Kiemeney, L. A.; Haugen, A.; Zienolddiny, S.; Lam, S.; Johansson, M.; Grankvist, K.; Schabath, M. B.; Andrew, A.; Lazarus, P.; Arnold, S. M.; Zhu, D.; Brenner, H.; Neuhouser, M. L.; Hung, R. J.; Christiani, D. C.; McKay, J.; Cai, G.; Xia, J.; Amos, C. I.

2026-05-15 genetic and genomic medicine 10.64898/2026.05.11.26352741 medRxiv

Top 0.3%

1.9%

Show abstract

Background: Genome-wide association studies (GWAS) have identified numerous lung cancer susceptibility loci based on single nucleotide polymorphisms (SNPs), yet a substantial proportion of heritability remains unexplained. We therefore evaluated germline copy number variants (CNVs) as an underexplored source of genetic susceptibility and potential contributors to genomic instability in lung cancer. Methods: We conducted a genome-wide analysis of germline CNVs using 19,342 cases and 15,917 controls from the Transdisciplinary Research in Cancer of the Lung (TRICL) consortium, with replication in two independent cohorts. High-confidence CNVs were identified by integrating two CNV callers including PennCNV and modSaRa2. Association analyses were performed using both gene-based and CNV region-based approaches. Polygenic risk scores (PRS) were constructed from top loci, and functional validation was conducted using siRNA-mediated knockdown in lung fibroblast cells. Results: We identified CNVs in four genomic regions (1p36.22, 2q31.2, 6p21.32, and 19q13.32) significantly associated with lung cancer risk. Two loci (1p36.22 and 2q31.2) were consistently supported across both analytical strategies. A CNV-based PRS constructed from key genes (CLCN6, NFE2L2, OPA3, and PSMB8) was significantly associated with lung cancer risk and replicated across independent datasets. Functional assays demonstrated that knockdown of NFE2L2 and OPA3 increased endogenous DNA damage, supporting a role in genomic stability. Conclusions: Germline CNVs contribute to lung cancer susceptibility and may influence carcinogenesis through mechanisms related to genomic instability. Impact: These findings expand the genetic architecture of lung cancer and highlight CNVs as potential biomarkers for improving risk stratification and informing precision prevention strategies.

13

Machine learning methodology using a masked neural network for robust genetic risk score calculation from noisy and missing data

Squires, S.; Weedon, M. N.; Oram, R. A.

2026-05-20 genetic and genomic medicine 10.64898/2026.05.18.25341725 medRxiv

Top 0.4%

1.9%

Show abstract

Purpose: Genetic risk scores (GRSs) are summaries of genetic data that can improve prediction of disease risk and progression. GRSs are increasing available but rely on high quality input data to produce good output results; with noisy or missing inputs the GRS may be inaccurate. We aimed to develop a method to produce a robust estimate of the GRS when input data is missing, noisy or both. Approach: We developed a neural network approach, named masked-MLP, for robust GRS calculation trained on a set of GRS scores calculated on clean data. The masked-MLP includes additional input data and has noise inserted during training, both which make the model more robust. Results: A GRS for type 1 diabetes (T1D) calculated on input data with 10\% of the data corrupted had a Spearman rank correlation to the clean GRS of 0.669 (0.665-0.674) while the equivalent for the masked-MLP was 0.951 (0.950-0.952). For the same data the area under the receiver operating characteristic curve for separation of T1D from population samples fell from 0.919 (0.904-0.932) to 0.808 (0.787-0.827) for the GRS while the masked-MLP fell to 0.910 (0.895-0.924). Conclusions: The masked-MLP was more robust to noise when calculating a GRS than using standard approaches. Our approach has the potential to ensure both improved research and clinical outcomes due to more reliable GRS calculation.

14

Characterising the Stability of Polygenic Risk Scores: implications for risk stratification

Ferreira, A.; Lind, P. A.; Moody, H.; Hickie, I. B.; Olsen, C. M.; Whiteman, D. C.; Law, M. H.; Siskind, D. J.; Martin, N. G.; Medland, R. C.; Medland, S. E.

2026-05-20 genetic and genomic medicine 10.64898/2026.05.17.26353273 medRxiv

Top 0.4%

1.9%

Show abstract

Polygenic risk scores (PRS) improve progressively as genome-wide association studies (GWAS) increase in sample size and ancestral diversity, yet the effect of successive GWAS releases on individual PRS rankings remains poorly characterised. Here, we quantify how individual PRS rankings change across GWAS releases, whether those changes favour cases over controls, how consistently individuals maintain their relative position, and whether those in high-risk strata retain that classification over time. Using PRS derived from four GWAS releases for bipolar disorder, major depressive disorder, and schizophrenia in three Australian cohorts, we observed widespread bidirectional reclassification that exceeded the theoretical minimum of expected reclassification, and was directionally consistent with case-control status when discriminative performance improved. Rank variability was substantial and uniformly distributed across all levels of risk, rank persistence was limited across releases, and retention of high-risk classifications was variable across disorders and largely accounted for by the inter-release correlation. These findings demonstrate that individual PRS rankings are dynamic and shaped by progressive improvements in effect-size estimates, carrying important implications for PRS-based risk stratification strategies that rely on stable classifications in psychiatric research and clinical practice.

15

Locally adaptive conformal prediction intervals for polygenic score-based phenotype prediction via residual normalization and data-driven stratification

Yun, Y.; Hao, X.; Zhang, Y. D.

2026-05-30 genetic and genomic medicine 10.64898/2026.05.28.26354326 medRxiv

Top 0.4%

1.8%

Show abstract

Quantifying uncertainty in polygenic score (PGS)-based phenotype prediction is crucial for the integration of genomic data into precision medicine. While the PGS provides a fundamental pivot for point estimation, clinical decision-making necessitates the construction of well-calibrated prediction intervals that reliably encompass the true phenotypic values. However, phenotypic residuals are frequently characterized by complex heteroscedasticity and stratified variance structures across diverse demographic contexts. Existing approaches often rely on global calibration mechanisms, which fail to account for such localized variance structures and lead to systematic miscalibration within specific subpopulations. To bridge this gap, we propose Clustering-based Split Conformal Prediction with Normalized Residuals (C-SCNR), a versatile framework based on Split Conformal Prediction. By adopting residual normalization and incorporating a repetitive `split-and-cluster` mechanism, C-SCNR dynamically identifies latent error strata and applies fine-grained adjustments to the resulting intervals. Our framework requires no distributional assumptions regarding the phenotype, is compatible with any PGS method, and flexibly accommodates biologically-informed grouping. Simulation studies demonstrate that our framework consistently outperforms existing methods across diverse error distributions. In real-data applications analyzing Body mass index (BMI), Low-density lipoprotein (LDL) cholesterol, and High-density lipoprotein (HDL) cholesterol in the UK Biobank, C-SCNR effectively resolves the coverage deficiencies of existing methods in specific subgroups and consistently yields superior localized calibration. Overall, C-SCNR represents a flexible and powerful framework for constructing high-resolution context-specific prediction intervals, thereby facilitating more reliable clinical interpretations of polygenic risk.

16

Integrating enriched case data from national laboratory testing with population-based case-control analyses: a novel statistical likelihood-ratio methodology for PS4 applied to 325,345 breast cancer cases and 671,006 controls

Allen, S.; Rowlands, C. F.; Garrett, A.; Couch, F.; Richardson, M. E.; Pesaran, T.; Pethick, J.; Lavelle, K.; McRonald, F.; Vernon, S.; Torr, B.; Loong, L.; Aungraheeta, R.; Durkie, M.; Burghel, G. J.; Callaway, A.; Robinson, R.; Field, J.; Frugtniet, B.; Palmer-Smith, S.; Grant, J.; Pagan, J.; McDevitt, T.; Snape, K.; Hanson, H.; McVeigh, T.; Loveday, C.; Jones, M.; Hardy, S.; Turnbull, C.; CanVIG-UK,

2026-05-17 genetic and genomic medicine 10.64898/2026.05.13.26353095 medRxiv

Top 0.4%

1.8%

Show abstract

Background: For many evidence criteria within v3.0 of the ACMG/AMP guidelines, methodologies have been developed to empower their use outside the stipulated evidence strengths. However, no such methodology has been established for case-control data (PS4). With the release of large-scale unselected case-control datasets and expansion of nationally-collected laboratory datasets enriched for pathogenic variant carriers, there is potential to combine datasets across ascertainment contexts in a more quantitative manner using novel likelihood ratio tools. Methods: Using our published PS4-LR-Calculator, we calculated a combined log likelihood ratio (PS4-LLR) across five datasets (three unselected, and two enriched), and estimated enrichment of pathogenic variants in clinically-ascertained laboratory data using truncating variant prevalence. Results: Data were combined for 10,817 missense variants from 325,345 female breast cancer patients and 671,006 controls of Western European ancestry for five breast cancer susceptibility genes (BRCA1, BRCA2, PALB2, ATM, CHEK2). A combined LLR was produced for 4,690 missense variants; 927 variants received evidence towards pathogenicity (LLR[≥]1), and 3,242 received evidence towards benignity (LLR[≤]-1). Conclusion: This flexible, variant-level methodology combines nationally-collected 'enriched' datasets with unselected case-control cohorts, expanding the available information for case-control analysis, boosting power, enabling exploration of atypical penetrance and empowering variant classification.

17

Distinguishing Age-specific Patterns in Comorbidities of Obstructive Sleep Apnea Using Real-World Data

Goodman, M. O.; Alex, R. M.; Sands, S. A.; Azarbarzin, A.; Batool-anwar, S.; Pavlova, M. K.; Epstein, L. J.; Redline, S.; Cade, B. E.

2026-05-28 epidemiology 10.64898/2026.05.20.26352336 medRxiv

Top 0.4%

1.7%

Show abstract

Obstructive sleep apnea (OSA) is associated with a wide range of comorbidities, but the extent to which these follow predictable, age-dependent patterns is not well understood. Identifying such patterns could provide insight into OSA heterogeneity and its links to physiological measures of OSA. We trained age-dependent topic models (ATM) on longitudinal electronic health records from 36,426 patients with OSA in the Mass General Brigham Biobank. ATM organizes incident diagnoses into distinct comorbidity "topics," whose age-specific disease loadings represent predictive patterns linking related diagnoses across the life course. We applied the trained model to compute individual-level topic scores in independent data: a cohort of 11,689 OSA cases and 22,695 matched controls, and a cohort of 6,220 patients with polysomnography (PSG)-derived physiological measures. We identified 19 distinct age-dependent comorbidity profiles, all significantly associated with OSA case status (FDR-adjusted p<0.05). Topics reflected recognizable clusters including metabolic, neuropsychiatric, and immune-mediated conditions, and several were distinguished by age-of-onset of key comorbidities, such as early- vs late-onset asthma. Seventeen of the 19 topics were significantly associated with at least one of 13 PSG-derived physiological measures, including associations between cardiometabolic topics and the apnea-hypopnea index, sleep apnea specific hypoxic burden, and respiratory event-specific heart rate burden. These findings indicate that age-dependent comorbidity patterns distinguish meaningful OSA subtypes with differing prognoses and endophenotype associations. ATM offers insight into complex OSA comorbidity and suggests that age-informed, topic-based stratification may improve individualized risk assessment, interpretation of PSG findings, and targeting of clinical interventions.

18

Calibrated Prediction Intervals for Polygenic Scores: Updated Comparisons, Contextual Calibration, and Data Normalization

Chang, X.; Hou, S.; Zhou, X.

2026-05-19 genetic and genomic medicine 10.64898/2026.05.15.26353336 medRxiv

Top 0.4%

1.7%

Show abstract

Calibrated prediction intervals for polygenic scores (PGS) are essential for communicating individual-level uncertainty in genomic medicine. We present updated comparisons of two methods for constructing such intervals: CalPred, a parametric approach, and PredInterval, a non-parametric approach. Our results show that both methods can achieve calibrated coverage, although CalPred additionally requires a sufficiently large calibration set. The two methods also exhibit complementary trade-offs with respect to dataset size and risk identification. We further show that contextual calibration, as introduced in Hou et al. and followed in Shi et al., is most naturally achieved through appropriate phenotype normalization and data preprocessing. Apparent miscalibration can arise from inadequate normalization or from providing contextual information to some methods but not others. In UK Biobank, standard GWAS phenotype normalization procedures are sufficient to achieve contextual calibration for traits analyzed. In the extreme simulations of Hou et al. and Shi et al., supplying contextual covariates to PredInterval restores contextual calibration without normalization, and appropriate normalization can achieve contextual calibration without supplying covariates, while also substantially improving upstream tasks including association power and PGS accuracy. Together, these results underscore the central role of phenotype normalization and data preprocessing in GWAS analyses, including reliable uncertainty quantification for PGS.

19

Evaluation of the Contribution of Natural Selection to Greater Cardiometabolic Disease Risk in South Asian Populations

Searby, D. J. C.; Hemani, G.; Chong, A.; Lawson, D. J.; Chaturvedi, N. J.; Davey Smith, G.

2026-05-22 genetic and genomic medicine 10.64898/2026.05.15.26353234 medRxiv

Top 0.4%

1.7%

Show abstract

A greater genetic susceptibility has been proposed as an explanation of the greater rates of cardiovascular and metabolic disease in South Asian relative to European populations. We first demonstrate that after accounting for technical artefacts the genetic effects for related traits are largely consistent between ancestral groups, which downplays the role of GxG or GxE interactions driving differential prevalence. If higher genetic susceptibility in South Asians is due to selective pressures acting through adiposity-related traits in the evolutionary past, signatures of selection should be evident at loci associated with cardiometabolic disease and other causally related traits (e.g. fat distribution). We tested for enrichment of several selection statistics (FST, XP-EHH and XP-nSL) at loci associated with a range of traits related to cardiometabolic disease, in comparison to a null distribution of linkage disequilibrium (LD) score and minor allele frequency (MAF) matched SNPs. Loci associated with a subset of these traits (Type 2 diabetes mellitus, trunk fat percentage, body fat percentage and trunk fat mass) exhibited enrichment for FST, consistent with a moderate adaptive explanation for their cross-population differentiation. In contrast, none of the studied traits were enriched for haplotype-based statistics, indicative that cross population genetic divergence is unlikely to have been driven by recent selective sweeps but has rather likely arisen from either ancient selection or recent polygenic selection acting on standing variation.

20

Investigating the Y chromosome in complex disease: Phenome-wide scan across 104,334 Finnish men

Preussner, A.; Leinonen, J. T.; FinnGen, ; Pirinen, M.; Tukiainen, T.

2026-06-10 genetic and genomic medicine 10.64898/2026.06.09.26355235 medRxiv

Top 0.5%

1.7%

Show abstract

Although the Y chromosome represents roughly 2% of the male genome, it is often ignored in genome-wide association studies (GWAS). Subsequently, the potential health impacts of Y-chromosomal genetic variation remain incompletely understood. To fill this gap, we performed a phenome-wide association study (PheWAS) in FinnGen across 1,426 binary and quantitative traits using Y-chromosomal variation (frequency [≥] 1%) in 104,334 genotyped men. As Y chromosome variation is prone to population stratification, we performed carefully adjusted association analyses and further examined these through kin-based validation in 19,275 female and 24,712 male 1st degree relatives. We found 121 suggestive (p < 5.6x10-3) phenotypic associations in the Y chromosome, yet none of these were strong enough to reach phenome-wide significance (p < 3.9x10-6). While only 38 associations were supported in the kin-based validation, intriguingly we found support for a previously suggested link between haplogroup I1 and coronary heart disease (CHD; OR=1.06, 95%CI=1.02-1.11, p=3.7x10-3; male validation OR=1.05; female validation OR=0.97). The I1-CHD association was detected across distinct geographical areas within Finland and was independent from Loss of Y (LOY) and the autosomal risk to CHD, proposing a link between germline Y-chromosomal variation and heart disease risk. Overall, this study presents a comprehensive phenome-wide analysis of Y-chromosomal associations, highlighting the potential relevance of Y-chromosomal variation beyond sex determination. Our findings further emphasize the need for improved capture of Y-chromosomal variants and further analyses in biobank-scale data to allow for deeper exploration of male-specific genetic architecture of complex diseases.